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Abstract: This paper is concerned with the detection of multiple change- 
points in the joint distribution of independent categorical variables. The 
procedures introduced rely on model selection and are based on a penalized 
least-squares criterion. Their performance is assessed from a nonasymptotic 
point of view. Using a special collection of models, a preliminary estima- 
tor is built. According to an existing model selection theorem, it satis- 
| fies an oracle- type inequality. Moreover, thanks to an approximation result 

, demonstrated in this paper, it is also proved to be adaptive in the minimax 

sense. In order to eliminate some irrelevant change-points selected by that 
• first estimator, a two-stage procedure is proposed, that also enjoys some 

adaptivity property. Besides, the first estimator can be computed with a 
complexity only linear in the size of the data. A heuristic method allows to 
implement the second procedure quite satisfactorily with the same compu- 
tational complexity. 
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1. Introduction 



Let Yx, Y2, ■ ■ ■ , Y n be independent random variables taking value in the finite set 
{1, . . . , r}, where r is an integer and r > 2, and let s be the joint distribution of 
(Yx, Y2, . ■ ■ , Y n ). Assume that {1, . . . , n} can be partitioned into intervals such 
that all the with indices i in a same interval follow the same law. Then 
' s is said to have change-points located at the beginning of each interval, 1 

excluded. In this paper, our aim is to detect change-points in s, using no a priori 
information on their number. A typical example of application is given by the 
DNA segmentation problem, for which the review (fH) by Braun and Miiller may 
serve as an introduction. The n-uple (Yx, Y%, . . . , Y n ) provides indeed a model 
for the successive bases along a DNA sequence of length n, when coding the 
set of bases {Adenine, Cytosine, Guanine, Thymine} by {1, ... ,4} for instance. 
Thus, beyond the theoretical properties of the statistical procedures, a special 
attention must be paid to their computational complexity, due to the length of 
sequences such as DNA ones. 

Several methods based on a penalized criterion, with a penalty typically in- 
creasing with the number of change-points, have been proposed for the statisti- 
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cal problem under consideration. Braun, Braun and Miiller present in (0) such 
a procedure, based on a penalized quasi-deviance criterion, and prove consis- 
tency results for the estimation of the change-points and the true number of 
change-points. Nevertheless, the computational complexity of their estimator, 
though reduced by using dynamic programming, is quite costly, with 0(n 3 ) com- 
putations, or 0{n Dmax) if an upper-bound D max is imposed on the number 
of change-points. Lebarbier and Nedelec also study penalized criteria in (fl7l ). 
one based on least-squares, the other on maximum likelihood. Their procedures 
are based on the model selection principle developed by Birge and Massart in 
various papers, such as (|5|). Thus they adopt a wholly different point of view 
from that of Braun et al.: the estimators studied in (l7l ) are nonparametric 
and are proved to satisfy a nonasymptotic oracle-type inequality, for an ade- 
quate choice of the penalty. But, when considering all possible configurations of 
change-points, these procedures suffer from the same computational complexity 
as that of Braun et al. In view of significantly reducing the computational time, 
the CART-based procedure proposed by Gey and Lebarbier in a Gaussian re- 
gression framework (cf. (fljl )) can be adapted to the framework considered here, 
as illustrated in Chapter 7. In the best case, the number of computations 
falls down to only Q(n ln(n)). Unfortunately, apart from the the oracle- type 
inequality given in (|l70 . theoretical properties of that hybrid procedure seem 
difficult to establish. Adopting the same approach as in (17) or Durot, 
Lebarbier and Tocquet propose in (|13l ) quite a general framework for estimat- 
ing s relying on a penalized least-squares criterion, where the choice of the 
penalty is supported by an oracle-type inequality. As a particular case, Durot 
ct al. recover one of the change-point detection methods proposed in (| 1 71) . They 
complete the study of its performance with an improved oracle-type inequality 
and an adaptivity result in the minimax sense. Let us also mention some other 
methods, not based on penalized criteria, that enjoy some interesting compu- 
tational complexity. They are not supported however by theoretical results. Fu 



and Curnow propose in (|14l ) an estimator based on maximum likelihood, impos- 
ing a constraint on the minimal lengths of the segments to prevent overfitting. 
According to fioh , it can be implemented with a computational complexity only 
linear in the size of the data. Szpankowski, Szpankowski and Ren study in (|20l ) 
a procedure inspired from Information Theory. It also has a linear complexity, 
that results from the splitting of the sequence into blocks of a prescribed length. 

Following the work presented in (0), ([13) and ([H), we propose in this paper 
two statistical procedures based on a penalized least-squares criterion, using the 
same model selection principle. Each estimator we build is piecewise constant 
on a partition of {1, . . . , n}. If the distribution s is piecewise constant, then the 
partition associated with the estimator allows to estimate its change-points. We 
first study an estimator based on a special collection of models in correspondence 
with the partitions of {1, ... ,n} into dyadic intervals only. That collection of 
models satisfies two important properties. On the one hand, it has been chosen 
for its potential qualities of approximation. They have been suggested by a 
theorem due to DeVore and Yu (cf. (jl^) 1 ) about the approximation of functions in 
Besov spaces by piecewise polynomials. Adapting their proof to our framework, 
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we prove that our collection of models has indeed good approximation qualities 
with respect to Besov bodies, some discrete analogues of balls in a Besov space 
defined in this article. On the other hand, the number of models per dimension is 
much lower for that collection than for the analogous one associated with all the 
partitions of {l,...,n} into intervals, also called exhaustive collection in (fl7l ) 
and (l3l). So no extra logarithmic factor appears in the oracle- type inequality 
satisfied by our first estimator. The conjunction of both properties allows to 
prove an adaptivity result in the minimax sense over Besov bodies. Notice that, 
because of those two interesting properties, a similar collection of models has 
lately been used by Birge (cf. p) and (0)) and Baraud and Birge (cf. {]])) 
for estimation by model selection in various statistical frameworks. About our 
first procedure, we must underline that considering such a reduced collection 
of partitions also happens to reduce the computational complexity to the so 
wanted linear complexity. It should also be noted that the hypothesis that s is 
piecewise constant is not used to derive any result. Therefore, whatever s, that 
first procedure still provides an interesting estimator of s. For the detection 
of change-points, if it does detect some relevant ones, it also selects some less 
significant ones, due to the nature of the selected partition. That's why we 
propose the following hybrid procedure. A preliminary stage consists in using 
part of the data to select a partition into dyadic intervals with the previous 
procedure, that will henceforth be called preliminary procedure. During the 
second stage, the rest of the data is used to select, among the rougher partitions 
built on the previous one, the one minimizing a penalized least-squares criterion. 
The resulting hybrid estimator also enjoys some adaptivity property, similar to 
that of the first procedure, up to a ln(n) factor. Moreover, in practice, it can 
also be implemented quite efficiently with a linear complexity. 

The paper is organized as follows. In the brief section [21 we describe the sta- 
tistical framework and introduce notation used throughout the paper. The next 
two sections are devoted to the theoretical study of the preliminary estimator 
and of the subsequent hybrid estimator. The performance of these procedures 
are illustrated in section [5] through a simulation study. In particular, we discuss 
there the practical choice of the penalties constants. The paper ends with the 
proof of the approximation result needed to derive the adaptivity properties of 
both estimators. 

2. Framework and notation 
2.1. Framework 

We observe n independent random variables Y\ , . . . , Y n defined on the same 
probability space (f2, A, P) and with values in {1, . . . , r}, where r is an integer 
and r > 2. Moreover, we assume that n is a power of 2 and write n = 2 N . The 
distribution of the n-uple (Yi, . . . ,Y n ) is defined as the r x n matrix s whose 
i-th column is 

Si = (F(Yi = 1) . . . P(Yi = r)) T , for 1 < i < n. 
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Observing (Yi, . . . , Y n ) is equivalent to observing the random r x n matrix X 
whose i-th column is 

X, = (ly i=1 . . . l Y ,=r) T , for 1 < i < n. 

It should be noted that the distribution s to estimate is in fact the mean of X. 

2.2. Notation 

Let ^#(r, n) be the set of all real matrices with r rows and n columns. Given 
an element t G ^#(r, n), we denote by t^ its Z-th row and by U its i-th column. 
The space ^#(r, n) is endowed with the inner product defined by 

n r 

<*.*') = EE t W°- 

That product is linked with the standard inner products on W and R™, denoted 
respectively by (., .) r and (., .)„, by the relations 

n r 

(t,t>) = J2(ti,t'i)r = J2( t{l) > t ' il) )n- 
i=l 1 = 1 

The norms induced by these products on ^#(r, n), W and K™ are respectively 
denoted by ||.||, ||.|| r and ||.||„. Another norm on Ji(r, n) appearing in this paper 
is 

Htlloo := max{\t\ l) \;l <i<n,l<l<r}. 

Let us now define some subsets of ^#(r, n) of special interest. The set com- 
posed of the r x n matrices whose columns are probability distributions on 
{1, . . . ,r} is denoted by & . Given a subspace S of M", the notation W ® S 
stands for the linear subspace of ^#(r, n) composed of the matrices whose rows 
all belong to S. 

Any vector u in M™ is identified with the function defined from {1, . . . , n} into 
R and whose value in i is Uj, for z = 1, . . . , n. In particular, for any subset / of 
{1, . . . , n}, we will call indicator function of /, and denote by 1/, the R™-vector 
whose i-th coordinate is equal to 1 if i € /, and null otherwise. 

When the distribution of (Yi, . . . , Y n ) is given by s, we denote respectively 
by F s and E s the underlying probability distribution on (fl® n , A® n ) and the 
associated expectation. 

Last, in the many inequalities we shall encounter, the capital letters C, C\, . . . 
stand for positive constants, whose value may change from one line to another. 
Sometimes, their dependence on one or several parameters will be indicated. 
For instance, the notation C(a,p) means that C only depends on a and p. 



imsart-ejs ver. 2007/09/18 file: ejs_2008_170.tex date: February 2, 2008 



N. Akakpo/ Detecting change- points in a discrete distribution 



5 



3. Preliminary estimator 

We study in this section a first estimator of the distribution s. For detecting 
change-points in s, it will be used in the next section during a preliminary stage. 
We begin here with the definition of that preliminary estimator: we explain the 
underlying model selection principle and easily justify the choice of the involved 
penalty thanks to (|13l). Then, we present the main result of this paper, about the 
adaptivity of this estimator. It derives from an approximation result that will 
be proved later in the article. Last, we describe the algorithm used to compute 
the estimator and give its computational complexity. 



3.1. Definition of the preliminary estimator 

Let M. be the collection of all the partitions of {1, . . . , n} into dyadic intervals. 
In order to describe it in a more constructive way, let us introduce the complete 
binary tree T with N + 1 levels such that: 

• the root of T is (0,0); 

• for all j G {1, ... , N}, the nodes at level j are indexed by the elements of 
the set A(j) = {(j,fe),fc = 0,...,2J-l}; 

• for all j G {0, . . . , N - 1} and all k G {0, ... , 2 3 - 1}, the left branch that 
stems from node (j, k) leads to node (J + 1, 2fc), and the right one, to node 
(j + l,2fc+l). 

The node set of T is TV = L)f =0 A(j), where A(0) = {(0, 0)}. The dyadic intervals 
of {1, ... ,n} are nothing but the sets 

l m = {k2 N ~i + 1, . . . , (k + l)2 N ~j} 

indexed by the elements of TV. Hence we deduce a one-to-one correspondence 
between the partitions of {1, . . . ,n} that belong to M. and the subsets of M 
composed of the leaves of any complete binary tree resulting from an elagation 
of T. We consider the collection of linear spaces of the form W ® S m , where 
rn G M. and S m is the linear subspace of W 1 generated by the indicator functions 
{I/,/ G m}. In the sequel, the term "model" refers indifferently to such a 
subspace of ^(r,n) or to the associated partition in M.. For all m G A4, the 
least-squares estimator of s in W (g) S m is defined by 

s m = argmin \\X — t\\ 2 . 

Ideally, we would like to choose a model among the collection M. such that the 
risk of the associated estimator is minimal. However, determining such a model 
requires the knowledge of ,s. Therefore the challenge is to define a procedure m, 
based solely on the data, that selects a model for which the risk of §m almost 
reaches the minimal one. In other words, the estimator % should satisfy a 
so-called oracle inequality 

K[\\s - s^f] <C inf E4|| S -s m || 2 ]. 
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Besides, as is usually the case, the risk of each estimator s m breaks down into 
an approximation error and an estimation error roughly proportional to the 
dimension of the model. Indeed, for all m e M, the estimator s m satisfies 



\S - S m \\ 2 + (1 - ||«||oo) Ar. < M\\ S - § rn\\ 2 ] <\\S~ S m \\ 2 +(1--)D 



r 



m • 



where s m is the orthogonal projection of s onl r ® S m and D m is the dimension 
of S m (cf. (13), proof of Corollary 1). Reaching the minimal risk among the 
estimators of the collection thus amounts to realizing the best trade-off between 
the approximation error and the dimension of the model, that vary in opposite 
ways. Therefore, we consider the data-driven procedure 

m = argmin{|jX— s m || 2 + pen(m)}, 

meM 

where pen : M. — ► K + is called penalty function. The preliminary estimator s 
of s is then defined as 



Regarding the choice of an adequate penalty, we rely on results proved in (|l3l ). 
They provide us with the following oracle inequality, up to a quantity depending 
on ||s||oo, which justifies the choice of a penalty simply linear in the dimension 
of the models. 

Proposition 1. Let pen : M. — > M + he a penalty of the form 

pen(m) = c a D m , 

where, form 6 M, D m is the dimension of S m . Ifco is positive and large enough 
and if \\s\lao < 1, then 

E s [\\s-S\\ 2 ] < C(c )(l - Halloo)" 1 inf E s [||.s- s m || 2 ]. (3.1) 

Proof. Let us introduce the subcollections of models of same dimension 

M D = {me M s.t. D m = £>}, for 1 < D < n. 

We look for a penalty satisfying the hypotheses of Corollary 1 in (13), otherwise 
said of the form 

pen(m) = {k\ + k 2 L{D m ))D mi 

where k\ and k 2 are positive constants, and {L(D)}i<rx n is a family of positive 
numbers, called weights, such that 



J2 \M D \eM~DL(D)) < 1. 



D=l 

In fact, it is enough to require that 

L(D) > (In \M D \)/D + In 2, for all 1 < D < n. 
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Since the cardinal of Md is equal to the number of complete binary trees with 
D leaves resulting from an elagation of 7~, it is given by the Catalan number 
D ~ l C { E-P)> and thus upper-bounded by 4 D . Consequently, we can set all the 
weights equal to a same constant. Inequality (|3.1|) then follows from the proof 
of Corollary 1 in □ 

From now on, we will always assume that the preliminary estimator derives from 
a penalty of the form pen(m) = coD mi where the constant cq is positive and 
large enough so as to yield an oracle- type inequality. By way of comparison, 
let us mention that the similar procedure based on the exhaustive collection 
of partitions of {1, ... ,n} only satisfies an oracle- type inequality such as ()3.1|) 
within a m(n) factor, owing to the greater number of models per dimension for 
that collection (cf. (|13l). Proposition 1). 

Last, notice that s does not necessarily belong to & . Nevertheless, since the 
vector (1 ... 1) belongs to any S m , for m E Ai, the elements in a same row 
of s sum up to 1. In order to get an estimator of s with values in we can 
consider the orthogonal projection of s on the closed convex whose risk is 
even smaller than that of s. 



3.2. Adaptivity of the preliminary estimator 

Though the oracle- type inequality (|3.1[) ensures that, under a minor constraint 
on s, the estimator s is almost as good as the best estimator in the collection 
{s m }m,eMi it does not provide any comparison of s to other estimators of s. 
Therefore, we now pursue the study of s adopting a minimax point of view. We 
consider a large family of subsets of to be defined in the next paragraph. Let 
us denote by S some subset in that family. Our aim is to compare the maximal 
risk of s when s belongs to S to the minimax risk over S. From Theorem 1 



in (|13l). it easily follows that an upper-bound for the risk of s is 

E S [|| S -S|| 2 ] <C(cq) inf • { inf ||« - S " m || 2 + £>}, (3.2) 

l<D<n LmeA1r> > 

where we recall that Md = {m G M s.t. dim(5' TO ) = D} and s m is the orthogo- 
nal projection of s on R r ®S m . Consequently, the approximation qualities of our 
family of models with respect to each subset S remain to be evaluated. More pre- 
cisely, for each subset <S, and each dimension D, we shall provide upper-bounds 
for the approximation error ini me M D II s — * m 1 1 2 when s £ S. 



As in (|13I). we consider subsets of S? whose definition is inspired from the 
characterization in terms of wavelet coefficients of balls in Besov spaces. In order 
to define them, we equip R n with an orthonormal wavelet basis: the Haar basis. 

Definition 1. Let A = U^^Afj), where A(-l) = {(-1,0)} and 
A(i) = {(j,k),k = 0,..., 2^-1} 

for < j < N — 1. Let tp : K — ► { — 1, 1} be the function with support (0, 1] that 
takes value 1 on (0, 1/2] and —1 on (1/2, 1]. 
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If A = (—1,0), (f>\ is the vector in H. n whose coordinates are all equal to \ j\fn. 
If A = (J, k), where j ^ -1 and k G A(j), 0> zs the vector in K™ whose i — th 
coordinate is 



2 i/2 



'A, 



<p 2 J k , fori = l,...,n. 




The functions {</>a}agA a?"e called the Haar functions. They form an orthonormal 
basis of W 1 called the Haar basis. 

This basis is closely linked with the collection of partitions A4: the Haar func- 
tions from a same resolution level j, < j < N— 1, are indexed by the nodes at 
level j in the tree T (cf. Section [3Tj) , which give the supports of these wavelets. 
Besides, any element t € ^#(r, n) can be decomposed into 

N-l 

t= E E ft** 

i=-iAeA(j) 

where, for all A e A, ft is the column-vector in R r whose Z-tli coefficient is 
f)y — (t( l \(f)\} n , for / = l,...,r. So, we improperly refer to the f3\S as the 
wavelet coefficients of t. We then define Besov bodies as follows. 

Definition 2. Let a > 0, p > and i? > 0. TTie se< composed of all the elements 
t £ ^#(r, n) smc/i i/ia< 

(N-l 
2 jp(a+i/2-i/p) V" 
i=o AeAO) 

where, for I = 1, . . . , r, = (t^, 4>\) n , is denoted by 38{a,p, R) and called a 
Besov body. The set of all the elements of 3? that belong to £§(a,p, R) is denoted 
by ^*(a,p, R). 

In particular, for an element of Besov body, the size of the wavelet coefficients 
from a same resolution level j is all the smaller as j is high. For a wide range 
of values of the parameter (a,p, R), we are able to bound the approximation 
errors appearing in (|3 - 2|) uniformly over £P(a,p, R). 

Theorem 1. Let p e (0,2], a > l/p- 1/2 and R>0. For all D G {l,...,n}, 
sup inf \\s - s m \\ 2 < C{a,p)nR 2 D- 2a . 

That result will be proved in section [6] 

Let us now come back to our initial problem, that is comparing the perfor- 
mance of s to that of any other estimator of s. For a > 0, p > and R > 0, the 
minimax risk over ^(a,p, R) is given by 

TZ(a,p,R) =inf sup E s [||s-s|| 2 ] 

s s£.9>(a,p,R) 
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where the infimum is taken over all the estimators s of s. Thanks to the above 
approximation result, we obtain, as stated below, that, for a whole range of 
values of (a,p,R), the estimator s reaches the minimax risk over 0P(a,p, R) 
within a multiplicative constant. Otherwise said, s is adaptive in the minimax 
sense over that range of subsets of @* . 

Theorem 2. For all p e (0, 2], a > 1/p - 1/2 and jT x l % < R < n a , 

sup E4||s-S|| 2 ] < C(co,a,p)K{a,p,R). (3.3) 

s^S"(a,p,R) 

Proof. Let usfixpg (0,2], a > 1/p - 1/2 and rr 1 ! 2 < R < n a . Combining 
Inequality (|3.2[) and Theorem [T] leads to 

sup E s [||s-s|| 2 ] < C(c ,a,p) inf {nR 2 D- 2a + D). 

se&»(a,p,R) l<D<n 

In order to realize approximately the best trade-off between the terms nR 2 D~ 2a 
and D, that vary in opposite ways when D increases, we choose D as large as 
possible under the constraint D < nR 2 D~ 2a . Let us denote by D* the largest 
integer D such that D < (ni? 2 ) 1 /( 1+2a ^ ) . One can easily check that, given the 
hypotheses linking n and R, D* does belong to {l,...,n} and provides the 
upper-bound 

sup E S [|| S ~ s|| 2 ] <C(c ,a,p)(ni? 2 ) 1 /( 2 «+ 1 ). 

The matching lower bound for the minimax risk over &(a, p, R) has been proved 
in jl3) (Theorem 3). □ 



3.3. Computing the preliminary estimator 

Since the penalty only depends on the dimension of the models, we will also 
denote by pen(D) the penalty assigned to all models in Md, for 1 < D < n. A 
way to compute s could rely on the equality 



mm 



{\\X — s m \\ + pen(m)) = min < min \\X — s m \\ 2 + pcn(£>) >. 

l<D<n I m£MD j 



That would lead us to compute a best estimator for each dimension, before 



choosing one among them by taking into account the penalty term, as in (|17l ) for 
the exhaustive collection of partitions or in (0). But, even when using Bellman's 
algorithm, that requires polynomial time. Here, we shall see that we can avoid 
such a computationaly intensive way by taking advantage of the form of the 
penalty. 

Let us express more explicitly the criterion to be minimized by m. For m € 
M y we denote by {ik, ■ ■ ■ , ik+i ~ 1}; 1 < k < D m , the dyadic intervals composing 
that partition, where 1 = %\ < ii < ... < ir> m < *D m +i = n + 1. For all 
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1 < k < D m , any column of s m whose index belongs to {ik, ■ ■ ■ , ik+i — 1} is 
equal to the mean X(ik '■ ik+i) of the columns of X whose indices belong to 
the interval {ik, ■ ■ ■ ,ik+i — !}• Owing to the form of the penalty, and to the 
additivity of the least-squares criterion, the whole criterion to minimize breaks 
down into a sum: 

D m 

\\X - s m \\ 2 + pcn(m) = y^£fa,i fc+1 ), (3.4) 

k=l 

where, for all 1 < k < D m , 

C(ik,ik+i) = co + X! x (ik --ik+i)\\l- 

i=i k 

By comparison with the method suggested in the previous paragraph, we are 
left with only one minimization problem, with no dimension constraint, instead 
of n. 

We now turn to graph theory where our minimization problem finds a natural 
interpretation. We consider the weighted directed graph G having {1, . . . , n+ 1} 
as vertex set and whose edges are the pairs (i, j) such that {i, . . . , j — 1} is a 
dyadic interval of {1, . . . , n} assigned with the weight C(i, j). A little vocabulary 
will be helpful. We say that a vertex j is a successor to a vertex i if (i, j) is an 
edge of the graph G and we associate to each vertex i its successor list IV 
For all 1 < D < n, a D + 1-uple ■ ■ ■ ,*u+i) of vertices of G such that 

i\ = 1, id+i = n + 1 and each vertex is a successor to the previous one, will be 
called a path leading from 1 to n + 1 in D steps. The length of such a path is 
defined as J^,, C(ik,ik+i)- Determining m thus amounts to finding a shortest 
path leading from 1 to n + 1 in the graph G. That problem can be solved by 
using one of the simplest shortest-path algorithms, the one dedicated to acyclic 
directed graphs, presented in (0) (Section 24.2) for instance. For the sake of 
completeness, we also describe it in Table l3Tl We have to underline that there 
are only 2n—l dyadic intervals of {1, . . . , n\. Therefore, the graph G, with n+ 1 
vertices and 2n — 1 edges, can be represented by only 0(n) data: the weights 
£(i,j), for 1 < i < n and j £ Ti, and the successor lists I\, for 1 < i < n. In the 
key step of the algorithm, i.e. step 2, each edge is only considered once. When 
the time comes to consider the edges with origin i, the variables d{i) and p(i) 
respectively contain the length of a shortest path from 1 to i and a predecessor 
of i in such a path. Just before the edge where j € Ti, be processed, the 

variables d(j) and p(J) contain respectively the length of a shortest path leading 
from 1 to j and a predecessor of j in such a path, based solely on the edges 
that have already been encountered. Then dealing with the edge consists 
in testing whether the length of the path leading from 1 to j can be shortened 
by going via i and updating, if necessary, d(j) and p(j). What clearly appears 
from the above description of the algorithm is that its complexity is only linear 
in the size n of the data. 
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Table 3.1 
Algorithm for computing s 

Step 1 : Initialization 

Set <2(1) = and p(l) = +oo. 

For i = 2, . . . , n + 1, 

set d(i) = +oo and p(i) = +oo. 

Step 2 : Determining the lengths of the shortest paths with origin 1 
For i = 1, . . . , n, 

for j er„ 

if d(j)> d(i) + C(i,j), 

then do d(j) <— d(i) + C(i,j) and p(j) <— i. 

Step 3 : Determining a shortest path P from 1 to n + 1 
Set pred = p(n + 1) and P = (n + 1). 
While pred ^ +oo, 

replace P with the concatenation of pred followed by P, 

do pred <— p(pred). 

Step 4 ■' Computing the preliminary estimator 
Set D = length(P) - 1. 
For k = 1,...,D, 

for % = P(k), .. . , P(fc + 1) - 1, 
set =X(P(k) : P(fc + 1)). 



4. Hybrid estimator 

Let us give a first glimpse of what can be expected from the preliminary es- 
timator for detecting change-points in the distribution s. In figure [U we plot 
the first line of a distribution s a £ ^#(2, 1024) that is piecewise constant over a 
partition with only 3 segments together with the first line of a realization of Sa. 
The value of Co has been chosen so as to minimize the distance between s a and 
its estimator. Both change-points in s a are indeed detected. But this example 
also shows that the selected partition, due to its special nature, is highly likely 
to contain some segments whose endpoints do not correspond to any significant 
rupture in s. In order to get rid of those, we propose a two-stage procedure, 
that we name hybrid procedure. After describing it, we provide an adaptivity 
result for that procedure and end this section with computational issues. 

In the sequel, we suppose that n > 2. In order to implement the hybrid pro- 
cedure, we need to work with the set .<tf(r, n/2) of r x (n/2) real matrices. That 
requires to define a series of notations, very close indeed to those encountered up 
to now. For all t £ ^ (r, n), we denote by t' (resp. t°) the element of ^ (r, n/2) 
composed of the columns of t whose indices are even (resp. odd). We equip 
^4K(r, n/2) with the norm analogous to the norm ||.|| on ^{r, n). For the sake of 
simplicity, we will also denote by ||.|| that norm on ^#(r, n/2). For a partition 
m of {1, ... , n/2}, we denote by S' m the linear subspace of W 1 / 2 generated by 
the indicator functions of the intervals I £ m and by D' m its dimension. These 
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Fig 4.1. First lines of the distribution s a (full line) and of its preliminary estimator s a 
(dashed line), as functions of i, 1 < i < 1024. 



notations being settled, we are now able to define the hybrid estimator of s. 
First, we compute the preliminary estimator of s° based on X', that is s*, and 
we thus get a random partition of {1, . . . ,n/2} into dyadic intervals denoted 
by m". Then, we consider the random collection M* of all the partitions of 
{1, . . . , n/2} that are built on m*. For each partition m of {1, . . . , n/2} into 
intervals, the least-squares estimators of s° in W (g> S' m is defined by 

s° m = argmin \\X° — t\\ 2 . 

tGR r «iS,' I1 

We select 

rh° — argmin|||X° — s° m || 2 + peh° (m) } , 

where the penalty peh° will be chosen in the next paragraph. We define the 
penalized estimator of s° based on the collection M. * as s° m o . Last, we define the 
hybrid estimator Sh y b of s as the random matrix in ^#(r, n) whose submatrices 
composed respectively of columns with even indices and of columns with odd 
indices are both equal to s° m o . 

Let us study Sh y b from a theoretical point of view. Under a mild assumption 
on s, we derive from the results proved in the previous section the following 
adaptivity property for Shyb- 

Theorem 3. Let D be the cardinal ofrh' and peh° : M* —> M + be a penalty of 
the form 

pifi°(m) = (ci + c 2 In (D/D' m )^D' m , (4.5) 

where c\ and c 2 are positive. If \\s* — s°|| 2 < Cln(n) and Co , c\ and c 2 are large 
enough, then, for all p G (0, 2], a > 1/p — 1/2 and R such that n~ x / 2 < R < n a , 

sup E s [\\s - § hyb \\ 2 ] < Cln{n)K(a,p,R), (4.6) 
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where C only depends on Co, Ci, C2, a and p. 

Thus, with Inequality ()4.6|) . we recover a result similar to Inequality (|3.3j) . up 
to a logarithmic factor. 

Proof. For all 1 < £> < D, the number jVd of partitions in M.' with Z? pieces 
satisfies ^ ^ 

/n_i\ / P f)\ D 

Nr 















\D- 







The above inequality results from a property of binomial coefficients that may 
be found in (|18l ) (Proposition 2.5) for instance. So the weights defined by 

L(D) = ln(2e) + \n(D/D), for 1 < £> < D, 

are such that 

D 

Nnexp(-DL{D)) < 1. 

D=l 

Moreover, the penalty peh° given by (14. 5|) fulfills the hypotheses of Theorem 1 
in (|131 ) provided ci and c 2 are large enough. With a slight abuse of notation, for 
any partition m of {1, ... , w/2}, we still denote by i m the orthogonal projection 
of an element t S ^#(r, n/2) on M r ® S^. Working conditionally to X', the 
collection .M* is deterministic, so we deduce from Theorem 1 of ffl3h applied to 
the estimator s°m° of s° that 

E s o[|| s °-r° A =|| 2 |X'] <C(c 1 , C2 )[|| S -^ rfl .|| 2 + f5eh°(m')]. (4.7) 

We recall that s* = s*™. . So, thanks to the triangle inequality, and since an 
orthogonal projection is a shrinking map, we get 

\\s° -^™-|| 2 < - ,s'|| 2 + - S ~'|| 2 ). 

Besides, for all m S M.', 

pen°(m) < C(ci, c 2 ) ln(n)D^. 

Taking into account the last two inequalities and integrating with respect to X m 
then leads from (14.71) to 



E s [||.s°-^ A o|| 2 ] <C( Cl , C2 )[||.s°- s '|r+E s .[|| s '-.s'|| 2 ] +ln(n)E s .(Z4.; 

where D^. is nothing but Z?. Besides, it follows from the definition of Shyb that 

\\s - ~s hy b\\ 2 = \\s* - s° fn o\\ 2 + \\s°- r° A o|| 2 . 
Applying the triangle inequality, we then get 

lis - s hyb f < c(\\ s - - s °\\ 2 + \\s°- r° i;i0 || 2 ). 
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Consequently, 

-Sfc»if] <C( Cl ,c 2 )[|| S °- S *|| 2 +E s .[|| S '- S ~'|| 2 ]+ln(n)E s .( J D)]. (4.8) 

Let us denote by M! the set of all partitions of {1, ... , n/2} into dyadic intervals. 
For the risk of s', Theorem 1 of 



13f ) provides 

E s .[\\s' -s'\\ 2 ] <C(c ) inf {\\s* -^ m \\ 2 + D' m }. (4.9) 

In order to bound the term E s . (D), we need to go back to the proof of Theorem 
1 in (fl3h (Section 8.1). As already seen during the proof of Proposition [TJ we 
can choose a positive constant L such that J2 m eM' ex P( — -^-^m) < !• Let us fix 
a partition m £ M! and £ > 0. Using the same notation as in (ill), we deduce 
from the proof of Theorem 1 in (|13r ) that there exists an event f^(m) such that 
P s «(0^(m)) > 1 — exp(— £) and on which 

c D < Cx\\s m - iV|| 2 + C 2 (c )D' m + C 3 D + 
Therefore, if Co > C3, then 

5<c( C o)(i| s # -i F „ 1 |i 2 + ^;„ + £). 

Integrating this inequality and taking the infimum over m € M! then yields 

E S .(D) <C(co) inf {\\s'-l* m \\ 2 + D' m }. (4.10) 

Moreover, one can check that 

inf {|| s «-i* m || 2 + z4}< M {\\s-s m \\ 2 + D m }. (4.11) 

Combining Inequalities (|4.8|) to (|4.1ip and the assumption on ||s* — s°|| 2 , we 
finally get 

E s r||s - ShybH 2 ] < C(c ,ci,c 2 )ln(n) inf {\\s - s m \\ 2 + D m \ . 
We then conclude the proof as that of Theorem O □ 



Regarding the computation of Sh y b, we know from Section [3.31 that deter- 
mining s* only requires 0(n) computations. On the other hand, since peri is 
not linear in the dimension of the models, rh° has to be determined following 
the method suggested at the beginning of Section 13.31 and using Bellman's al- 
gorithm. If we impose an upper-bound D max on the dimension of the model 
selected during the second stage, determining rh° given X* then requires of the 
order of D 2 D max computations. Since D is upper-bounded by n/2, we can only 
ensure that the computational complexity of Shyb is, in the worst case, of the 
order of n 2 D max . However, we will see in Section [5] that, in practice, the hybrid 
procedure can also be implemented with a linear complexity only and with quite 
satisfactory results. 
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5. Simulation study 

In the previous sections, we were only interested in giving a form of penalty 
yielding, in theory, a performant estimator. The aim of this section is to study 
practical choices of the penalty for each procedure. Several simulations allow 
to assess the relevance of these choices and to illustrate the qualities of each 
procedure. 

5.1. Choosing the penalty constant for the preliminary estimator 

We have examined the cases r — 2 and r = 4, with different values of n = 2 N . 
For r = 2, the distribution s is entirely determined by its first line, that is the 
only one to be plotted, as a function of the parameter i, 1 < i < n (cf. Figured] 
for s a and Figure [5721 for s& to s e ). For r = 4, examples Sf to st are plotted in 
Figure 13751 Part of our examples, s a , s&, Sf and s g , are piecewise constant. We 
also extend our study to other examples of distributions having jumps, such as 
s c and Sh, whose lines are piecewise affine. But the estimation capacities of s, 
and not only its ability to detect change-points, deserve to be illustrated. So, 
we also present smoother examples, if we may say so for functions of a discrete 
parameter, such as Sd or s e . 

As already said in Section [3TTl the estimator s has been designed for satisfying 
an oracle inequality, what it almost does according to Proposition [T] Therefore, 
the risk of the oracle, i.e. inf m£ _\4 E s [\\s — s m |j 2 ] , serves as a benchmark in order 
to judge of the quality of s, and also of the quality of a method for choosing 
a penalty constant. We have studied two methods for choosing an adequate 
penalty constant. The different quantities introduced in the sequel have been 
estimated over 500 simulations. The first method aims at determining the value 
of the constant cq that almost minimizes the risk of s, whatever s. Denoting by 
s(c) the preliminary estimator when cq takes the value c, we have estimated 

c*(s) := argminE s \\\s — s(c) |j 2 ] , 

c 

where, in practice, we have varied c from to 4, by step 0.1, and from 4 to 6 by 
step 0.5. We plot in Table [5721 an estimation of c* and the ratio Q* between an 
estimation of E s \\\s — s(c*)|| 2 ] and the estimated risk of the oracle. In view of 
the results obtained here, we come to the following conclusions: taking cq = 2 
seems reasonable when r = 2, but taking cq = 2.5 seems more appropriate when 
r — 4. We give in Table [5721 the ratio Q c between the estimated risk of s(c) and 
the estimated risk of the oracle, where c = 2 for r = 2 and c = 2.5 for r = 4. 
Comparing Q c to Q* confirms that the choice of those values for is Co relevant. 
Nevertheless, a good penalty should adapt to the unknown distribution s to 
estimate. That's why we have also tried a data-driven method, inspired from 
results proved by Birg and Massart in a Gaussian framework (cf. (@)). That 
method has already been implemented in the same framework as ours in (|13l ). 
Section 8. Given a simulation of (Yi, . . . ,Y n ), the procedure we have followed 
can be decomposed in three steps: 
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Fig 5.2. First lines of s (full line) and 3 (dashed line), computed with a data-driven 
penalty, for s € {s b , s c , s d , s e } . 
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Fig 5.3. Four lines of s (full line) and s (dashed line), computed with a data-driven 
penalty, for s € {sf,s g ,s h }- 
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• determine the dimension D(c) of the selected partition for each value c of 
the penalty constant Co, where one varies c from to 3, by step 0.1; 

• compute the difference between the dimensions of the selected partitions 
for two consecutive values of cq and retain the value c corresponding to 
the biggest jump in dimension under the constraint D(c) < D max , where 
Dmax is a prescribed maximal dimension; 

• choose the constant £j = 2c to compute the preliminary estimator. 

We have taken D max of the order of n/(m(n)) M , with a close to 2. That choice 
is inspired in fact both from the method proposed in (|20l ) and from a constraint 
appearing in the theoretical results of (|l7h when using a penalized maximum 
likelihood criterion (cf. Condition (2.17) in Theorem 2.3. of fl7l)). That choice 
seems to yield good results, whatever s or n. Here we have set D max = 30 when 
N = 10, Dmax = 100 when N = 12 and D max = 175 when N = 13. In order 
to assess the performance of that second method, we give in Table [5?2l the ratio 
Qj between the estimated risk of s for that procedure and the estimated risk of 
the oracle. We also give estimations of the mean value and the standard-error 
of Cj , denoted respectively by Cj and Oj . 

Table 5.2 

Performance of the preliminary estimator for different choices of the penalty constant. 



s 


r 


N 


c* 


Q* 


Qc 


c i 


°j 


Qj 


s a 


2 


10 


1.7 


2.4 


2.6 


2.2 


0.3 


2.7 


Sb 


2 


10 


1.7 


1.9 


1.9 


2.6 


0.4 


2.1 


Sc 


2 


10 


1.8 


1.8 


1.8 


2.5 


0.4 


1.8 


Sd 


2 


13 


2.2 


1.5 


1.6 


2.2 


0.1 


1.6 


Sc. 


2 


13 


2.2 


1.7 


1.8 


2.2 


0.1 


1.8 


S f 


4 


10 


2 


1.4 


1.5 


3.3 


0.6 


1.7 


S 9 


4 


12 


2.5 


1.3 


1.4 


2.5 


0.1 


1.3 


Sh 


4 


10 


2.6 


1.3 


1.3 


2.7 


0.2 


1.3 



Let us analyze the results of the simulations. In terms of risk, both methods 
have in fact roughly the same performance. Nevertheless, the first one requires to 
calibrate anew a constant when changing the value of r, whereas the data-driven 
method has the advantage to automically adapt to the value of r. Therefore, 
the latter should be recommended, and that is the one we have used to build 
the estimators plotted in Figures 15.21 and 15.31 Let us now examine the values 
of Q* (or Q c , or Qj) for the different examples. As foreseen by the oracle- type 
inequality (|3.ip . the ratio between the risk of the preliminary estimator and that 
of the oracle depends on s. In particular, the ratios Q* , Q c or Qj reach their 
highest value for s a . It should be noted that the first line of this example takes 
values very close to 1 on a large segment (cf. Figured]), a critical case according 
to the oracle-type inequality. However, for all examples studied here, the values 
of those ratios remain quite low, inferior or close to 2, except for s a . 
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5.2. Choosing the penalty constants for the hybrid estimator 

For the first stage of the hybrid procedure, the preliminary estimator has been 
computed using the data-driven penalty. For the second stage, the practical 
choice of an adequate penalty is more delicate, since the theoretical penalty 
depends in this case on two constants and on the dimension D of the partition 
selected during the first stage. We have first tried here the same method as 
Lebarbier in (|16l). Chapter 7, for her own hybrid procedure. So we have assigned 
to all partitions of {1, ... , n/2} into D intervals the same penalty 

pen°(D) = /3 X (2.5 + ln{D/D))D, 

where f3i is determined according to the same process as dj . That penalty is pro- 
portional to the penalty calibrated by Lebarbier in (l^ ) (Chapter 3). The latter 
was in fact designed for the estimation of a regression function in a Gaussian 
framework via model selection based on an exhaustive collection of partitions. 
Anyway, the major drawback of such a method, as said at the end of Section|4j is 
that we are only able to evaluate its worst case computational complexity, of the 
order of 0(n 3 ). So we have also tried to assign to all partitions of {1, ... , n/2} 
into D intervals the penalty 

where @2 is determined once again according to the same process as Cj. Since 
that penalty is a linear funtion of D, the hybrid procedure can be implemented 
in that case with only 0(n) computations. 

In order to draw a comparison between these procedures and with the pre- 
liminary one, we give in Table [5731 the following information for the distributions 
s a to s c and s/ to s g , still computed over 500 simulations. We first recall the 
dimension D of the partition on which s is built. Then we indicate the average 
dimensions Dq and Di of the partitions selected respectively by the preliminary 
procedure, with a data-driven penalty, and the hybrid procedure with peh°, for 
i G {1,2}. We also give the average value Qi-o of the ratio between the estimated 
risk of the hybrid estimator for pen°, for i £ {1,2}, and the estimated risk of 
the preliminary estimator. Let us compare both ways to implement the hybrid 
procedure. We observe that Q2.0 is almost always of the same order as Qi.q, and 
even slightly lower in most cases. Therefore, taking into account the computa- 
tional complexity, we cannot but recommend to use pen^- That is the choice we 
have made for the hybrid estimators represented in Figures 15.41 and 15.51 Let us 
now compare the hybrid procedure with the preliminary one for the examples 
under study. First, the values of D2 and Dq indicate that, with the former, the 
dimension of the selected partition is much closer to the true one. Moreover, the 
figures show that the most significant ruptures are still detected, are quite close 
to the true ones, and that irrelevant ruptures are much fewer with the hybrid 
procedure. The only price to pay is an increase in risk, but only by a factor of 
the order of 1.5. 
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Table 5.3 

Comparison between the hybrid procedure, for different penalties, and the preliminary 

procedure. 



s 


D 


Bo 


D 1 


D 2 


<2l:0 


Q2.Q 


Sa 


3 


7.7 


3.0 


3.4 


1.4 


1.3 


Sb 


8 


13.4 


4.9 


6.9 


1.5 


1.4 


Sc 


7 


11.7 


5.1 


5.1 


1.7 


1.8 


S f 


8 


11.5 


4.2 


5.9 


2.1 


1.6 


s 9 


5 


11.5 


7.8 


6.9 


1.8 


1.4 


Sh 


3 


5.3 


2.2 


3.1 


2.3 


1.5 




Fig 5.4. First lines of s (full line) and of its hybrid estimator (dashed line) for s £ 

{s a , Sb, Sc} ■ 
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1 




Fig 5.5. Four lines of s/, s g and Sh (full line) together with their hybrid estimators 
( dashed line). 
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6. Proof of the approximation result 

In this section, we prove Theorem [T] following the same path as DeVore and 
Yu in (12|). We first describe the approximation algorithm on which that result 
relies. Then, we give the main lines of the proof and also demonstrate the key 
result, that is a direct consequence of the approximation algorithm. The proofs 
of more technical points are postponed to the next subsections. 



6.1. Approximation algorithm 

Let us fix p £ (0, 2], a > 1/p— 1/2, R > and D £ {1, . . . , n}. In order to prove 
Theorem [TJ we look for an upper bound for 

inf \\t-t m \\ 2 

m£M d 

uniformly over t £ £$(a,p, R). An element t £ ^#(r, n) being fixed, the adaptive 



approximation algorithm presented by DeVore and Yu in (|12i ) allows to generate 
partitions into dyadic intervals depending on t such that the approximation 
error over each interval of the partitions is lower than a prescribed threshold. 
An adequate choice of that threshold is expected to yield a partition, depending 
on t, that belongs to M.d and almost realizes the above infimum. In order to 
describe precisely the algorithm and the way to use it for our approximation 
problem, let us introduce some notations. Let J be a dyadic interval of {1, . . . , n\. 
The restriction of the norm |.| to / is denoted by ||.||j. Let U be the linear 
subspace of R™ generated by the vector (1 ... 1), we denote by £2(^5 -0 the error 
in approximating t on I by an element of R r <X> U, i.e. 

£ 2 (t,I) = inf ||i-c||/. 

c£R T ®U 

Besides, both intervals obtained by dividing I into two intervals of same length 
are called the children of I. The algorithm proceeds as follows. We fix a thresh- 
old e > 0. At the beginning, the set l 1 ^, e) contains /(o,o) = {1, ■•■,«}• If 
£2 (t, I(o,o)) ^ e i then the algorithm stops. Else, /(o,o) is replaced in the parti- 
tion e) with his children, hence a new partition X 2 (i, e) of {1, ... ,n}. In 
the same way, the k-th step starts with a partition I k (t,e) of {1, . . . , n} into 
k dyadic intervals. If sup/gj^^) £2^, I) < e, then the algorithm stops, else an 
interval / such that £ 2 (t, I) > e is chosen in I k (t, e) and replaced with his chil- 
dren, hence a new partition l k+1 (t, e) of {1, . . . , n} into k + 1 dyadic intervals. 
The algorithm finally stops, giving a partition 2(t,e). Denoting by S(t,e) the 
linear space composed of the functions that are piecewise constant on T(t, e), 
the approximation A(t, e) of t associated with this partition is defined as the 
orthogonal projection of t on W ® S(t, e). So, the approximation error of t by 
A(t, e) satisfies 

\\t-A(t,e)\\ 2 = J2 (£2(t,I)f <\2(t,e)\e 2 . 

Z£l(t,e) 
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For any e > such that the algorithm stops at the latest at step D, the approx- 
imation of t that we get belongs to the collection {W ® S m } me M D ■ Therefore 

inf \\t-i m \\ 2 < \l{t,e)\e 2 . 

meMn 

Let us denote by £o{t) the infimum of \l(t, e)|e 2 taken over all e > satisfying 
|^(i>e)| < D. This is in fact the quantity that we shall bound, as indicated in 
Theorem 2] below. 

Theorem 4. Let p e (0,2], a > l/p- 1/2 and R > 0. For all D G {l,...,n} 

and t £ 38{a,p, R), 

£ D (t) < C(a,p)nR 2 D- 2a . 
We then get Theorem [1] as a straightforward consequence of Theorem HI 



6. 2. Proof of Theorem [^}- the main lines 

Here are the notions and notations that we will need along the proof. Let p > 0, 
a > and t € ^#(r, n). For every subset I of {1, . . . , rc.}, let 

£ p (i,I)= inf 

v fee/ 

We define the vector ^'"-p in R" whose coordinates are 

4' a ' p = sup /), for i = 1, . . . , n, 

where the supremum is taken over all the dyadic intervals / of {1, . . . , n} that 
contain i. We denote by ||.||^ p the (quasi-)norm defined on R n by 




i/;-' 



(that is a norm only for p > 1) and by ||.||^ / its restriction to a subset / of 
{1, . . . ,n}. We define on M. n the discrete Hardy-Littlewood maximal function 
M p by 

(M p ( u )) i = su P |/r 1 /f|| W ||, p , / , fori = l,...,n, 

where the supremum is taken over all the dyadic intervals / of {1, . . . , n} con- 
taining i. Last, we recall that every vector u € K n is identified with the function 
defined on {1, . . . , n} whose value in i is Ui, for 1 < i < n, hence the meaning 
of notations such as u < v or u q , where u S K", f G M™ and q > 0. 

The beginning of the proof directly results from the way the algorithm works 
out. A dimension D being fixed, choosing e > as small as possible such that 
the algorithm generates a partition with at most D intervals leads to a first 
comparison between the quantity £d(£) and D~ 2a , without making use of any 
particular hypothesis on t. 
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Proposition 2. Let a > and p(a) — (a + 1/2) -1 . For all D £ {1, . . . , n} and 

t £ J({r, n), 

La,2 II 2 n-2a 



£ D (i)<C(a)||t»^||^ )J D- 

Proof. If tf' Q > 2 = 0, then, whatever e > 0, £ 2 (^^(o.o)) < e ; so £d(*) = 0, which 
completes the proof in that case. Let us now suppose that t^' a ' 2 is non-null, and 
let e > 0. If £2(^^(0,0)) — e ; then \l(t, e)| = 1. Else, let / be a dyadic interval 
that belongs to I(t, e), then I is a child of a dyadic interval I such that 

e<£ 2 (t,l). 

Using the definition of £* ,Q ' 2 , we get, for all i £ L, 

£ 2 (t,I)<\l\ a+1/2 tf a ' 2 . 

Since I C I, \I\ = 2|/| and p(a) = (a + 1/2) -1 , the last two inequalities lead, 
for all i £ I, to 

e < 2 1 /p(")|/|l/p(a)^^ Q . 2 j 

licncc 

Then we deduce by summing over all the intervals J in the partition T(t, e) that 

e)| <2\\t^% ( p Z e ~ P{a) - 

Whether £ 2 (*,/(o,o)) < e or not, by choosing e = 2 1 / p M\\t*' a < 2 \\ e D' 1 ^, 
we get a partition l(t, e) that contains at most D elements and satisfies 

|X(£,e)|e 2 < Bl-VP(a) 2 2Ma}|| t |,a ) 2||^^_ 

As = (a + 1/2) -1 , we conclude that 



|2:(i J e)|e a <4 a + 1 /a||ilt.«.2||^ D 



, 2a 



□ 



The proof of Theorem |4] now relies upon three inequalities. The first one 
allows to draw a comparison between and D~ 2a via a term that does 

not depend on t"> a ' 2 anymore but on t$' a ' p ( a K It is the discrete analogue of a 
particular case of Theorem 4.3. of (fill). 

Proposition 3. Let a > and p(a) = (a + 1/2) -1 . For all t £ (r, n) , 

t^ a - 2 < C(a)M p(a) (tt< a < p ^). 
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From Propositions [2] and [3j we easily deduce that, for a > 0, p(a) = (a+l/2)~ 1 
and D e {!,..., n}, 



£ D (t) < C(a)||A/ p(Q) (t»^))||; p(Q)J D- 2 " 
Let us now fix p G (0, 2]. By Jensen's inequality, we have 

\W p(a) {^ p[a) )\\ lp{a) < « 1/p(Q) - 1/p ||M p(a) (t^^))|| £p 

and 
hence 

Though the most obvious comparison between a vector w and any of its maxi- 
mal functions is that the latter are greater than the first, the following maximal 
inequality also ensures a control of u over its maximal functions (cf. inequal- 
ity (|6.12p below) . That inequality is in fact the discrete version of a fundamental 
result in functional analysis, namely the Hardy-Littlewood maximal inequality, 
that may be found in (|2j) (Theorem 3.10) for instance. 

Proposition 4. Let q > 1. For all u € 1", 

Since the maximal function M q , q > 0, is related to Mi by the property 

M q (u) = (Afi(u«)) 1/9 ,for all u € R", 
Proposition 2] yields, for all r > q > and u £ K™, 

||^ g («)lk < <?M)Nk- (6.12) 

Thus, when applied with u = t* ,a ' p ) r = p and q — p(ct), this inequality leads to 

8 D (t) < C(a,p)n 2{a+ll2 - 1/p) p< a ' p \\\D- 2a . 

Last, Proposition [5] below provides the adequate control of the £ p -(quasi-)norm 
of t*' a ' p by the size of the wavelet coefficients of t and allows to complete imme- 
diately the proof of Theorem 2J 

Proposition 5. Let p € (0, 2] and a > l/p — 1/2. For all t £ ^#(r, n), 

i/p 



\t la ' p \\i p < C{a 1 p)n- {a+1/2 - 1/p) 



V j=Q A€A(j) / 



where, for all X £ A, (3\ stands for the column vector ofM 7 ' whose l-th line is 



/# =(tC0,^ A ) Tl ,/ or j = i,... |r . 
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6. 3. Proofs of Propositions [#| and [^] 

We present in a same section the proofs of Propositions [3] and [4] that both 
mainly call for the notion of decreasing rearrangement of a vector in R n . 

Definition 3. Let u G R™. The decreasing rearrangement of u is the R n - vector 
denoted by u* satisfying 

u\>u\>...>Un and {u*\ 1 < i < n} = {\u t \; 1 < i < n}. 

We will also make use of the Lorentz (quasi-)norms on R™ in the proof of Propo- 
sition [3j whose definition we recall here. 

Definition 4. Let < p < +oo and < q < +oo. We denote by \\-\\e pq the 
Lorentz ( quasi- )norm defined on W 1 by: 

. ifq is finite, \\u\\^ q = ( £? =1 r 1 ^ 1 /^*)^ 1/<? ; 

• ifq = +oo, \\u\\e PiOC = sup 1 < l <„i 1 / p <. 

For all subset J of {1, ... , n}, we denote by ||.||^ j the restriction of ||.||^„ to 
/. In particular, notice that, for all u G R™, < p < +oo and < q < +oo, 

\\u\U PiP = \\u\\i p and \\u*\\e P!g = \\u\\t Piq . 

The reader may find in the appendix other useful properties relative to these 
notions. 



6.3.1. Proof of Proposition 

The proof of Proposition [3] mostly relies on a lemma that we demonstrate in 
this paragraph, after introducing a few notations. Let / be a dyadic interval of 
{1, . . . , n}, t £ ^(r, n), and p > 0. By a compactness argument, there exists 
at least one vector in R r , denoted by v p (t,I), realizing the error £ p (t,l), i.e. 
satisfying 

£ p (t,I) = (J2¥k-v p (t,lW r ' 

■ kei 

We define the vectors u p (t,L) and t^ a,p ' 1 in R™ whose coordinates are null 
outside of / and given otherwise respectively by 

(u p (t,I)). = \\ti - v p (t, I)\\ r , for i e /, 

and 

t^*' 1 = sup \J\^ a+1/p) £ p (t, J), forieJ, 

where the supremum is taken over all the dyadic intervals J of {1, . . . , n} that 
are contained in / and contain i. 
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Lemma 1. Let a > 0, p > and t S ./#(r, n). Let L be a dyadic interval of 
{1, . . . , n} containing at least two elements. For all j € {1, . . . , | J|/2}, 

M^));<C(a,p)(|^ 

Proof. We fix j 6 {1, . . . , |/|/2}. Let E be the set composed of all the indices i 
in {1, ... satisfying (t^^ 1 ), > (t^ 01 ^ 1 )*. As |J5| < j - 1, we only have to 
prove that 

/m/2 \ 

(«p(*.-0)i < C(«,P)^ £ fe a - 1 (** ia ' P, ' r )fe +i a (*»* aiP ' J );J (6-13) 

for all the indices i € {1, . . . , n}, except maybe for those belonging to E. Con- 
sider is {1, . . . ,n} such that i ^ E. li i ^ I, then (itp(t, i")) . = 0, so Inequal- 
ity (|6 . 1 3[) is trivial. Suppose now that i £ L and i (£ E, and let {Ii}i<i< m be the 
sequence of dyadic intervals defined by 

Li = /, /ji+i is the child of containing i, and / m = {i}, 

where m > 2 because |/| > 2. Notice that, for all Z € {0, . . . , m — 1}, = 
2 _i |/|. Let q be the strictly positive integer such that 

2 -(<?+i)|/| < j < 2 -«|J|. 

Such a definition implies, in particular, that 2 _9 |/| > 1, so that q < m. From 
the triangular inequality, 

q m 
(u p {t,I)).<^2\\v p (t,I l - 1 )-V p {t,I l )\\ r + £ ||«p(*,/j-l)-«p(t,/,)||r, (6.14) 
1=2 l=q+l 

with the convention that the first sum in Inequality (|6.14p is null for q — 1. Let 
us fix / € {2, ... , m} and determine an upper-bound for the term \\v p {t, Ii-i) — 
v p (t,Ii)\\ r . We recall that C J;_i and |/f-i| = 2|//J. Besides, for all p > 0, 
the (quasi-)norm ||.||^ satisfies a triangular inequality within a multiplicative 
constant C(p), where we can take C(p) — 1 for p > 1, and C(p) = 2 1/p for 
< p < 1. Therefore, we get 

|| Wj ,(t, /,_!) - Wp (t, I,)|| P < CipMr^fSpit, J,_ X ) + f p (t, J,)) , 

which leads to 

||« p (t,J|_i) - w P (t,/i)||r <C(o,p)|Ji| 0, mint![' a ' , '" r . (6.15) 

fee/; 

Let us bound the first sum appearing in ()6.14|) . For all I 6 {2, . . . , m}, we have 



keli k V 7 1 7 "' I l<fc<|7; 
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and, as |7/+i| = \Ii\/2, 

r\h\ I* 1 

|/,| Q = C(a)/ x a ~ x dx < C{a) ^ k°- x . 
Jlh+l1 k=\i l+1 \ 

Consequently, when q > 2, Inequality (|6.15|) yields 

q i \h\ 

X)K(t,/i-i)-«p(t,/Ollr<C(a,p)X; E 

(=2 J=2fc=|7 I+1 | 

1*1/2 

<c(a, P )E* a ~ 1 (* ll,a ' p '')I- 

Regarding the second sum appearing in (|6.14p . we now use Inequality (|6.15p 
combined with the following remarks. For all I such that q + 1 < I < m, we have 



,a,p,I ^ 4 <i,a,p,I 

Therefore, 



minfcg/, t^ a,p ' < t\ ,a,p ' , since 7/ contains i, and we recall that |//| = 2 ^ 



E ii Vp (t,/ I _ 1 )-« p (t ) j I )iir<c(a,f.)ijr(t»-*- r ) < E 2 ^ 1)Q - 

l=q+l l=q+l 

Furthermore, remember that 2 - ( 9+1 )|/| < j and i ^ E, so we finally obtain 
E ||«p(t,/i-i)-« P (t > /j)||r<C(a,p)r(t ,l ' a,, "- r )J- 

Z=g+1 

We have thus proved inequality (|6 . 13[) and Lemma [TJ □ 

We are now able to prove Proposition [3l Let a > 0, p(ot) — (a + l/2)~ 1 and 
t £ ^(r, n). We fix j 6 {1, . . . , n}. From the definition of £%{t, I) for any subset 
I of {1, ... , n}, and due to the fact that £z(t, {«}) = 0, we have 

^ < S n V \I\-y^u p{a) {t,I)\\ t2 , 

I3i 

where the supremum is taken over all the dyadic intervals I of {1, . . . , n} that 
contain i, except for {i}. We fix such an interval I. The sequence { (u p ( a ) (t, I)) } 1<J<? 
decreases and is null for j > \ I\ + 1, hence 

1*1/2 x2 

|| u p(a)(*)-0||^ < 2 E ((%>(«)(*> T )Tj 
3=1 

From Lemma [1] and the definition of p(a), we get 

• 1*1/2 / 1*1/2 



/ 1*1/2 / |/|/2 W 

V i=i V fc=j / 



*i|2 
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Using one of Hardy's inequalities (cf. Proposition[8]in the Appendix) and notic- 
ing that > J < £».«.!>(<*), we are led to 

\\u p(a) (t,I)\\ l2 <C(a)\\t^^\\ lp{a)2j . 

Last, since p(a) < 2, we deduce from classical inequalities between Lorentz 
(quasi-)norms (cf. Proposition [7] in the Appendix) 

i^ a ' 2 < C(q) sup |/|-V^(«)|| t tl»«, P («)|| £ 

I3i 

where the supremum is taken over all the dyadic intervals I of {1, . . . , n} that 
contain i, which completes the proof of Proposition [3] 

6. 3. 2. Proof of Proposition ^ 

Let q > 1 and u £ M. n . As Mi (it) = Mi(|n|), we can suppose that u has positive 
or null coordinates. Let us first demonstrate that, for alH € {1, . . . , n}, 

(MiMtfWi-^rA (6.16) 

If i = 1, then this inequality easily follows from the definitions of (Mi(u))* and 
u*. Let us now fix i € {2, ... , n}. We can write «asu = ii + m, where v and w 
are the K n -vectors whose respective coordinates are 

Ufc = maxjwfc — u* , 0} and = minjufc, u*}, for fc = 1, . . . , n. 

From the triangular inequality, we deduce that M\{u) < Mi(v) + Mi(w). Propo- 
sition [6] (cf . Appendix) then leads to 

(M!(«))? < (M 1 (v)y ii/2 - ] + {M 1 (w))l m . 

Moreover, 

(Mi(w))l l/2i < HMiHIl^ < |MU., 
and, from Proposition [6] again, 

(Afx^))^ < 2r 1 ||t;|Ui. 

Consequently, 

(MxWJ^C^IIwll^ + HHU-). (6-17) 

Let / be the set of all the indices I, 1 < I < n, such that ui > u*. From the 
definitions of v and w, we get 

m 

M\e + »imu- < X)«fc + (* - i j ix = 

k=l k=l 
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which, given Inequality (|6.17| . completes the proof of (|6.16p . We now have 

\Wi(u)ni<c{q)Y, < _1 E u * ■ (° 8 ) 

i=l V k=l I 

Let us denote by q' the conjugate exponent of q, and write, for all k in {1, . . . , n}, 
u* k = k~ 1 / qq k x l qq u* k . We deduce from Hlder's inequality 

n ( i \ q n / \ / t \ 

E * _1 E< ^ E <^ 1/9 k 1 E fc w k) 9 • 

i=l \ fc=l / i=l \ / \ fc=l / 

Interchanging the order of the summations, we obtain 

n / i \ ? n 

E rl E^ <c(9)£(«*) 



i=i \ k=i / fe=i 



Consequently, 

||(Mi(tt))*|| <a < C{q)\\u*\\ lq , 

hence Proposition 2J 

6-4- Proof of Proposition^ 

Let p e (0, 2], a > l/p - 1/2 and i e ^#(r, n). For alH 6 {1, . . . , n} and all 
< J < AT, we denote by I(J, i) the only dyadic interval of length n1~ 3 that is 
contained in {1, . . . , n} and contains i. From the definition of i*' Q,p , we deduce 

N-l n 

\\t^% p < E(^ l2-/ ) QP+1 E (£ P (t,HJ,i))) P - (6-19) 

J=0 i=l 

Let us first suppose that < p < 1. From the definition of E p (t, I(J,i)), we 
have 

(£ p (t,/(J,i))) P < J] ||t fc -fill* 

For all — 1 < j < iV — 1, the functions {0a}agA(j) are constant over any dyadic 
interval of length n2~U +1 >. Therefore, if k belongs to I(J, i), then 

N-i 

tk ~ li = E E ^(0Afe-0Ai)- 

j=J AsA(j') 

As < f> < 1, we deduce from the classical inequality between £ p -quasi-norm 
and £i-norm 

n N—l 

Y i (£ P (tJ(J^))) P <2n^2-' J2 2M1/2 ' 1/P) E \W 

i=l jW AeA(j) 
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Interchanging the order of the summations, we get 

N-l 



3=0 AeA(j) 



AN?- 



Let us now consider the case 1 < p < 2. We fixO<J<7V — 1 and define 

AT-l 

T ( J ) = E E 

j=J AeA(j) 

As t — T(J) is constant over any dyadic interval of length n2~~ J , 

£ p (t,I(J,i)) =£ p (T(J),I(J,i)). 
This equality and the definition of £ p (T(J),I(J, z)) lead to 

n n 

EfcMwo)) <E E \\(T(J))J r 

i=l i=l keI(J,i) 

n , N-l 

^ElE E Hfl 

k=l V j=J AeA(j) 
From (|6.19[) and this last inequality, we get 

n N-l , N-l p 

ll^ P H? p <^ QP EE 2 J "E E H/3A||r|^ fc | . 

fc=l J=o \ j=J AeA(j) ' 

Then, using one of Hardy's inequalities (cf. Proposition [5] in the Appendix) and 
remembering that, for all j S {—1, . . . ,N — 1}, the functions {^aIagACj) have 
disjoint supports, we conclude that 

N-l n 

¥' a ' p \\ p i p < C(a,p)n-°* 2jap E WM^M 9 , 

J=0 AeA(j) fc=i 

hence Proposition [5l 



p 

\\\r\<P\ fcl 
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Appendix A: Some useful inequalities 

We state here, for vectors in R™, a few inequalities that are similar to classical 
inequalities for functions of a continuous parameter. The proofs of the latter, 
which may be found in (0), for instance, are easy to transpose to the finite- 
dimensional case. 

Proposition 6 (Some properties of decreasing rearrangements). Let u and v 
be two vectors in R™. For all A > 0, let I U (X) be the set of the indices k in 
{1, . . . , n} such that \uk\ > A. 



1) For all i G {l,...,n}, u\ = sup{A > Os.t. |/„(A)| > i}. 

2) If, for all i G {1, . . . , n}, Ui < Vi, then, for all i G {1, . . . , n}, u* < v* . 

3) For all i, j G {1, . . . , n} such that 1 < i + j < n, (u + < u* + v* . 
i)Forallie{l,...,n},{M l {u)Y i <i- 1 \\u\\ il . 



Proposition 7 (Inequalities between Lorentz (quasi-)norms). Letp,q and q' 
be positive reals and let u be a vector in K™. 



Proposition 8 (Hardy's inequalities). Let q > 1 and let tp be a vector in R - ' 
whose coordinates are non-negative. 

1) For all A < 1, 



Proof. See, for instance, (0), Proposition 1.7. and Theorem 3.3. 



□ 



1) Ifp<q, then\\u\\ epx <C(p,q)\\u\\ epq . 

2) Ifq' < q, then \\u\U p q < C(p, q, q')\\u\\i 



Proof. See, for instance, (0), Proposition 4.2. 



□ 




2) For all a > 0, 




Proof. See, for instance, (0), Lemma 3.9. 



□ 
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